1 Summary of contents

2 Directory organization

The chunk below describes

3 Introduction

The analysis is split into multiple parts.

  1. A database-dependent mapping and community profiling at the SDP and Phylotype level.
  2. An assembly-based approach which involes MAG binning per sample and dereplication. Requires manual annotation of input for the next section
  3. Core genome phylogeny construction of MAGs and isolate genomes.
  1. All high-quality MAGs
  2. One MAG per magOTU cluster
  1. Mapping and SNV calling of reads to
  1. Just MAGs
  2. A combination of isolates and MAGs

The samples used in this analysis are mentioned below.

3.1 Data summary

50 samples were first mapped to a host database and then the unmapped reads to a microbiome database.

The raw reads and trimmed reads were checked for quality using the tool fastqc. The report (html format) also summarises basic statistics including number of reads. The QC results can be found in their respective folders at ./fastqc/raw/{SAMPLE}_R*_fastqc.html and ./fastqc/trim/{SAMPLE}_R*_trim_fastqc.html.

4 Data description

The dataset comprises samples from 4 species of honey bees sampled in India.

Some samples from older publications of the lab are also included for Apis mellifera and Apis cerana

The data is stored in various locations as described below and backed up on the NAS.

4.0.1 Nomenclature

There are 56 samples at the moment.

  • M1.1 - M1.5 are 5 individuals of Apis mellifera from colony 1
  • Cx.1 - Cx.5 are 5 individuals of Apis cerana from colony x for 3 colonies (1 - 3)
  • Dx.1 - Dx.5 are 5 individuals of Apis dorsata from colony x for 3 colonies (1 - 3)
  • Fx.1 - Fx.5 are 5 individuals of Apis florea from colony x for 3 colonies (1 - 3)
  • DrY2_F1 and DrY2_F2 are samples from KE’s 2015 paper. Apis mellifera from switzerland (Les Droites)
  • AcCh05, AcKn01 and AmAi02, AmIu02 are two samples of Apis cerana and Apis mellifera from different apiaries in Japan

These samples were from earlier runs:

  • 20151119_WINDU89 20151119 Kirsten_Ellegaard 6 GTF Illumina 100 PE HiSeq 2500 Genomic diversity landscape of the honey bee gut microbiota (2019, NatCom) Nurses, Year 1, Les Droites 20160415_OBIWAN225 20160415 Kirsten_Ellegaard 12 GTF Illumina 100 PE HiSeq 2500 Genomic diversity landscape of the honey bee gut microbiota (2019, NatCom) Foragers/Winterbees, Year 1, Les Droites
  • 20161216_OBIWAN275 20161216 Kirsten_Ellegaard 6 GTF Illumina 100 PE HiSeq 2500 Genomic diversity landscape of the honey bee gut microbiota (2019, NatCom) Nurses, Year 2, Les Droites
  • 20170310_WINDU179 20170310 Kirsten_Ellegaard 12 GTF Illumina 100 PE HiSeq 2500 Genomic diversity landscape of the honey bee gut microbiota (2019, NatCom) Foragers/Winterbees, Year 2, Les Droites (INCLUDED FOR NOW)
  • 20170426_OBIWAN300 20170426 Kirsten_Ellegaard 6 GTF Illumina 100 PE HiSeq 2500 Genomic diversity landscape of the honey bee gut microbiota (2019, NatCom) Nurses, Year 2, Grammont
  • 20170428_WINDU191 20170428 Kirsten_Ellegaard 12 GTF Illumina 100 PE HiSeq 2500 Genomic diversity landscape of the honey bee gut microbiota (2019, NatCom) Foragers/Winterbees, Year 2, Grammont
  • 20180118_OBIWAN338-339 20180118 Kirsten_Ellegaard 30 GTF Illumina 100 PE HiSeq 2500 Metagenomes of individual honey bees, subjected to dietary manipulation and kept in the lab
  • 20180612_KE_japan_metagenomes 20180612 Ryo_Miyasaki 40 Japan Illumina 100 PE HiSeq 2500 Vast differences in strain-level diversity in two closely related species of honey bees (2020, CurBiol) Sampling and sequencing done in Japan (INCLUDED FOR NOW)

4.1 Databases

4.1.1 Host database

The database is named 4_host_db.

A paper published in Dec. 2019 a high quality Apis dorsata genome as an improvement over a previous submission in 2013. The paper also mentioned studies that had previously sequenced the Apis florea genome in 2012, Apis cerana genome in 2015 (other assemblies submitted found here) and Apis mellifera genome in 2018 (other assemblies submitted listed here). So far I have not found any whole genome assemblies of Apis adreniformis.

These assemblies were downloaded and concatenated to make the 4_host_db. It contains,

  • >apis_mellifera_2018 PRJNA471592 version Amel_Hac3.1
  • >Apis_cerana PRJNA235974
  • >Apis_cerana_mitochondrion PRJNA235974
  • >Apis_florea PRJNA45871
  • >Apis_dorsata PRJNA174631

4.1.2 Microbiome database

The database is named genome_db_210402.

This database was set up by Dr Kirsten Ellegard (KE) and is described on zenodo. It uses NCBI and IMG genome assemblies. It is non-redundant and contains concatenated genomes. Located at in lab NAS directory at lab_resources/Genome_databse. In this pipeline so far, the version of the pipeline set up by KE’s community profiling pipeline.

It was downloaded by the script download.py --genome_db from zenodo. This dowloads multiple directories. The Orthofider directory can be deleted as this is generated for the pipeline as needed. The bed files can be generated from gff files if desired but this was already done for the genomes of that database so was not repeated. The other files (ffn, gff) are found in the public repository from where the genome was downloaded. The faa files were reorganised in directories corresponding to their respective SDPs in order to allow the Orthofinder scripts to assign orthogroups per SDP.

These repositories follow their own annotation pipeline to generate these files. The database can also be found at <NAS>/lab_resources/Genome_database/database_construction. It contains 198 genomes identified by their locus tags and described in <NAS>/lab_resources/Genome_database/database_construction/database_construction in metadata sheets.

5 Description of pipeline methods

Run entire snakemake pipeline using:

snakemake -p --use-conda --conda-prefix /work/FAC/FBM/DMF/pengel/spirit/aprasad/Miniconda3/spirit_envs --conda-frontend conda --profile slurm --restart-times 0 --keep-going

and if resuming a failed or stopped run, use:

snakemake -p --use-conda --conda-prefix /work/FAC/FBM/DMF/pengel/spirit/aprasad/Miniconda3/spirit_envs --conda-frontend conda --profile slurm --restart-times 0 --keep-going --rerun-incomplete

conda environments are all specified in envs/ and built by snakemake under various names in /work/FAC/FBM/DMF/pengel/spirit/aprasad/Miniconda3/spirit_envs

Run the pipeline in the conda environment called snakmake_with_samtools in the cluster. It is a clone of the snakemake environment made as recommended by Snakemake docs followed by conda install biopython and later conda install samtools in it. This is so that Kirsten’s core_cov script works (specific conda environments can only be specified for rules using bash).

5.1 Description of directory structure

Directory names are largely self-explanatory.

./00_rawdata, ./01_Trimmed, ./02_HostMapping, ./03_MicrobiomeMapping database contains databases to be used for mapping. It also contains ./Orthofinder files. These are described later in the sections describing associated rules. ./envs contains all yaml files required for this pipeline. They contain a list of packages needed to specify the conda environment for various rules to work within. ./logs contains log files ./scripts contains all scripts needed for the snakemake pipeline. Many of these scripts are adapted from Kirsten’s scripts from the zenodo directories, github or from the lab_resources directories. The results of the core coverage estimation are stored in, ./04_CoreCov_211018_Medgenome_india_samples ./07_SNVProfiling is not fully implemented (yet) for these samples as it is not relevant at this time. ./fastqc contains fastqc results for trimmed and raw files + bamfile_list_red.txt - required by KE’s core coverage pipeline + bamfile_list.txt - required by KE’s core coverage pipeline + Adapters-PE.fa - is generated based on index sequences by the script ./scripts/write_adapters.py (was deleted earlier as it was on scratch. Needs to be re-written.) + config.yaml - comprises information including list of samples + index_table.csv - used by the script ./scripts/write_adapters.py to make indexed adapters + Mapping_summary.csv - result from the rule summarize_mapping + rulegraph.pdf - summary DAG of rules in the pipeline (made using snakemake --forceall --rulegraph | dot -Tpdf > Figuers/rulegraph.pdf) + Report.Rmd - this report ! + Report.html - this report compiled ! + Snakefile - the pipipeline !!!

5.2 Rules

DAG of all rules in the pipeline

  • rule raw_qc
    • This rule runs fastqc on raw files and saves the output in ./fastqc/raw.
  • rule make_adapters
    • This rule uses the script _./scripts/write_adapters.py_ which was deleted earlier.
    • It uses the index_table.csv files to make the Adapters-PE.fa file containing indexed adapters.
  • rule trim
    • This rules trims reads using trimmomatic.
    • The Adapters-PE.fa files is used.
    • The trimming parameters are.
  • rule trim_qc
    • This rule runs fastqc on trimmed files and saves the output in ./fastqc/trim.
  • rule index_bwa
    • Indexes genomes in ./database for use by bwa using bwa index.
  • rule index_samtools
    • Indexes genomes in ./database for use by samtools.
  • rule make_genome_list
    • Creats a text file corresponding to each set of genomes in ./database to be used when we need to know which genomes are present in given genome database.
  • rule host_mapping
    • Uses bwa mem to map reads for each sample to a database containing host genomes, ./database/4_host_db.
    • Unmapped reads identified by samtools with the option -f4 are stored in a seperate bam file.
    • The bam file with all alignments is used later by the counting rule and then deleted after counting.
  • rule host_mapping_extract_reads
    • Reads that did not map to the host database are extracted and then mapped to the microbiome database.
    • They are extracted using picard.
    • The option -Xmx8g ensures that java is given 8 GB memory. If suffecient memory is not allocated, the job will fail.
  • rule host_mapping_count
    • Counts the number of mapped, properly and unmapped reads from host mapping.
    • It uses the following flags to identify each kind of read:
      • count number of properly mapped reads: -f 67 -F 2304
        • 67 (include -f) flags
          • read paired (0x1)
          • read mapped in proper pair (0x2)
          • first in pair (0x40)
        • 2308 (exclude -F) flags
          • read unmapped (0x4)
          • supplementary alignment (0x800)
          • not primary (0x100)
      • count number of mapped reads: -f 67 -F 2304
        • 65 (include -f) flags
          • read paired (0x1)
          • first in pair (0x40)
        • 2308 (exclude -F) flags
          • read unmapped (0x4)
          • supplementary alignment (0x800)
          • not primary (0x100)
      • count number of unmapped reads: -f 67 -F 2304
        • 69 (include -f) flags
          • read paired (0x1)
          • read unmapped (0x4)
          • first in pair (0x40)
        • 2304 (exclude -F) flags
          • supplementary alignment (0x800)
          • not primary (0x100)
    • After counting, the bam file is deleted and an empty file is touched to mark that counting is complete for said file.
  • rule microbiomedb_mapping
    • The host unmapped reads extracted earlier are mapped to the microbiome database.
    • Mapped reads are extracted using a perls script as follows. First, unmapped reads are excluded using -F4 and then supplementary reads are excluded -F 0x800. Finally, the remaining reads are sent through ./scripts/filter_sam_aln_length.pl. The script filters away reads that have less than 50bps matching in the alignment.
  • rule microbiomedb_extract_reads
    • Extracts mapped reads identified as mentioned in the previous rule and saves them as fastq files.
  • rule microbiome_mapping_count
    • Counts reads as explained in the other counting rule, host_mapping_count.
  • rule cat_and_clean_counts
    • Compiles all the counts into 1 file for easier parsing by the summarize rule.
  • rule summarize_mapping
    • Summarizes counts in a csv file using the results of earlier rules and by counting fastq files.
  • rule run_orthofinder
    • Runs Orthofinder for each phylotype.
    • Before running this, group genomes by phylotype in directories for Orthofinder to be able to get which groups to consider together. When the genomes for the database are downloaded at ./database/faa_files/{genome}, they are all in one directory. Grouping was done using ./scripts/rearange_faa.py. As written, it is to be run from the scripts directory in which it resides (!! it uses relative paths !!).
    • faa files for each genome comes from the respective databese (NCBI for example)
    • When orthofinder finishes, the following file will be generated and used for the following steps, ./database/faa_files/{phylotype}/OrthoFinder/Results_dir/Orthogroups/Orthogroups.txt.
    • The file Orthogroups.txt contains a list of orthogroups. Eg, each line would look like
      • OG0000003: C4S76_01365 C4S76_01370 C4S76_01375 C4S77_06100 C4S77_06130 C4S77_06135 C4S77_06775 C4S77_06780 C4S77_06785 C4S77_06790 C4S77_06795 C4S77_06800 C4S77_06805 C4S77_06810 C4S77_09595 C4S77_09600 C4S77_09605 C4S77_09610 C4S77_09615 C4S77_09620 C4S77_10540 Ga0307799_111506
      • where, OG0000003 is an orthogroup for this group of genomes (phylotype) and C4S77, Ga0307799 etc. are genomes that belong to that group. 09615, 09620, 10540 are genes from C4S77 and 111506 from Ga0307799 that belong to orthogroup OG0000003.
  • rule get_single_ortho
    • The files Orthogroups.txt is parsed by ./scripts/get_single_ortho.py and single-copy orthologs are written to ./database/Orthofinder/{phylotype}_single_ortho.txt
    • The script reads each orthogroup and counts the number of genomes present in genes of that orthogroup. If the number of genes in the orthogroup and the number of genomes in the orthogroup are the same as the total number of genomes in the database for said phylotype, the genes in the group are considered single-copy core genes and included for core coverage estimation.
  • rule extract_orthologs
    • This rule prepares files with sequences of orthologs in order to calculate percentage identity (perc_id).
    • First, it reads the file ./database/Orthofinder/{phylotype}_single_ortho.txt and gets all the genome-ids present in the ortholog-file, and all the gene-ids associated with each gene-family. Using this list it extracts and stores the sequences of each of the genes of an orthogroup in an faa file and ffn file corresponding to each group in the directory./database/Orthofinder/{phylotype}_ortho_sequences/. EXAMPLE:
    • cat ./database/Orthofinder/firm5_ortho_sequences/OG0001034.faa
      >Ga0061073_1479
      MTKYQTLIFVPEGSLLNEKTAEQVALRQTLKELGHDFGPAERLKYSSLQGQVKMMGFSER
      IALTLQNFCTDDLAEAEKIFKTKLGGQRQLVKDAIPFLDQITNQVKLILLAKEERELISA
      RLSDSELLNYFSASYFKEDFADPLPNKNVLFQIIKEQELDPDNCLVIGTDLVEEIQGAEN
      AGLQSLWIAPKKVKMPISPRPTLHLTKLNDLLFYLELN
      >Ga0070887_12184
      MKGKVHLAKYETLIFILEGSLLNEKVAEQNALRQTLKLTGREYGPAERIQYNSLQEKIKL
      LGFDERIKLTLQEFFKNDWISAKGTFYNQLQKQDQLNKDVVPFLDEVKNKVNLVLLTKEK
      KDVASYRMQNTELINYFSAVYFKDDFACKFPNKKVLITILQQQNLTPATCLVIGTNLVDE
      IQGAENANLDSLWLAPKKVKMPISPRPTLHLNKLTDLLFYLELS
      >Ga0072400_11133
      LAKFQTLIFILEGSLLDEKIAEQSALKQTLKSTGRDFGPSERLKYNSVRENNKLLGFEDR
      IQLILQTFFHENWQDAGQIFIKELQKQNRLNKEVLPFLNKVNCKVKLILLAKENKKVALQ
      RMKNTELVNYFPFAYFKDDFTEKLPHKKVLTTILQKQNLAFATSLVIGTDLADEIQAAEN
      AKIQSLWLAPKKVKMPISPHPTLHLNKLNDLLFYLELS
    • cat ./database/Orthofinder/firm5_ortho_sequences/OG0001034.ffa
      >Ga0061073_1479
      GTGACTAAATATCAAACGTTAATTTTTGTTCCTGAAGGTAGTTTATTAAATGAAAAAACG
      GCTGAACAAGTCGCACTCAGGCAAACTTTAAAAGAACTCGGACATGATTTTGGACCAGCT
      GAACGCCTAAAATATTCTAGCTTACAAGGACAAGTTAAAATGATGGGTTTCAGCGAGCGC
      ATTGCACTAACCCTGCAAAATTTTTGTACCGACGATTTGGCTGAGGCCGAAAAAATTTTC
      AAAACAAAATTAGGAGGTCAGCGACAACTAGTCAAAGATGCTATTCCATTTCTTGACCAA
      ATAACAAACCAAGTTAAGCTAATTCTCCTTGCCAAAGAAGAACGTGAACTAATCTCAGCT
      CGCCTATCTGATAGCGAACTACTTAACTATTTTTCTGCTTCCTATTTTAAAGAAGATTTT
      GCTGATCCTTTGCCAAATAAAAATGTCCTGTTTCAAATTATAAAAGAGCAAGAATTAGAT
      CCAGATAATTGCCTAGTTATCGGCACAGATTTAGTTGAAGAAATTCAAGGAGCAGAAAAC
      GCTGGCTTGCAATCATTATGGATTGCACCAAAAAAGGTTAAAATGCCAATTAGTCCTCGA
      CCTACTCTGCATTTAACTAAACTCAATGACTTGCTTTTTTATCTTGAATTAAACTAG
      >Ga0070887_12184
      ATGAAAGGAAAAGTACACTTGGCAAAATATGAAACTTTAATTTTTATTCTTGAAGGAAGC
      TTATTAAACGAAAAAGTTGCAGAACAAAATGCACTTAGGCAAACTTTGAAATTAACTGGC
      AGAGAATATGGTCCAGCTGAGCGCATACAATATAATTCATTACAAGAAAAGATTAAATTA
      CTAGGATTTGATGAGCGCATTAAATTAACTTTGCAGGAATTCTTTAAAAATGACTGGATT
      TCTGCGAAAGGCACTTTTTATAACCAGTTGCAAAAACAAGATCAGTTAAATAAAGATGTA
      GTGCCCTTTTTAGATGAGGTGAAAAACAAAGTTAACTTGGTTTTGCTGACGAAAGAGAAA
      AAAGATGTGGCTTCATACCGCATGCAAAATACAGAGCTAATAAATTATTTTTCCGCAGTT
      TATTTTAAAGACGATTTTGCATGTAAGTTTCCAAATAAGAAGGTTTTGATAACAATATTG
      CAGCAGCAGAATCTGACGCCAGCCACTTGTCTTGTAATTGGGACAAACTTAGTCGATGAA
      ATTCAGGGTGCCGAAAATGCTAACCTGGATTCTCTTTGGCTAGCGCCCAAGAAAGTAAAA
      ATGCCAATTAGTCCACGTCCAACTTTACATTTAAATAAATTAACTGATTTATTATTTTAC
      CTAGAATTAAGCTAG
      >Ga0072400_11133
      TTGGCAAAATTTCAAACATTAATTTTTATTCTTGAGGGCAGTTTATTAGATGAAAAGATT
      GCTGAACAAAGTGCATTAAAGCAAACTTTAAAGTCAACTGGCAGAGATTTTGGTCCCAGT
      GAACGTTTAAAATATAATTCTGTACGAGAAAATAATAAGTTGCTTGGCTTTGAAGACCGC
      ATACAATTAATTTTACAAACATTTTTTCATGAAAATTGGCAAGATGCAGGGCAGATTTTT
      ATCAAAGAATTACAAAAGCAAAATCGCTTGAATAAAGAAGTATTGCCATTTTTAAACAAA
      GTTAACTGCAAGGTTAAACTAATTCTGCTGGCAAAAGAGAACAAAAAAGTAGCATTACAG
      CGCATGAAGAACACAGAGTTGGTAAATTATTTTCCGTTTGCTTATTTTAAAGATGACTTT
      ACGGAAAAATTGCCACATAAAAAAGTTTTGACCACCATTTTGCAGAAACAAAACTTGGCG
      TTCGCAACTAGTTTAGTAATCGGAACTGACTTAGCAGATGAAATTCAGGCTGCAGAGAAT
      GCCAAAATACAGTCACTCTGGCTAGCGCCTAAGAAAGTAAAAATGCCGATTAGCCCGCAC
      CCAACTTTACATTTAAATAAATTAAACGATTTATTATTTTACCTAGAATTAAGCTAG
  • rule calc_perc_id
    • this rule relies on various tools and scripts tied together by ./scripts/aln_calc.sh. The scripts are:
      • mafft for alignment
        • The aligned result is in a multi-fasta file called {OrthogroupID}_aln.fasta. EXAMPLE:
        • cat ./database/Orthofinder/firm5_ortho_sequences/OG0001034_aln.fasta
          >Ga0061073_1479
          ——-MTKYQTLIFVPEGSLLNEKTAEQVALRQTLKELGHDFGPAERLKYSSLQGQVK
          MMGFSERIALTLQNFCTDDLAEAEKIFKTKLGGQRQLVKDAIPFLDQIT—NQVKLILL
          AKEERELISARLSDSELLNYFSASYFKEDFADPLPNKNVLFQIIKEQELDPDNCLVIGTD
          LVEEIQGAENAGLQSLWIAPKKVKMPISPRPTLHLTKLNDLLFYLELN
          >Ga0070887_12184
          M-KGKVHLAKYETLIFILEGSLLNEKVAEQNALRQTLKLTGREYGPAERIQYNSLQEKIK
          LLGFDERIKLTLQEFFKNDWISAKGTFYNQLQKQDQLNKDVVPFLDEVK—NKVNLVLL
          TKEKKDVASYRMQNTELINYFSAVYFKDDFACKFPNKKVLITILQQQNLTPATCLVIGTN
          LVDEIQGAENANLDSLWLAPKKVKMPISPRPTLHLNKLTDLLFYLELS
          >Ga0072400_11133
          ——-LAKFQTLIFILEGSLLDEKIAEQSALKQTLKSTGRDFGPSERLKYNSVRENNK
          LLGFEDRIQLILQTFFHENWQDAGQIFIKELQKQNRLNKEVLPFLNKVN—CKVKLILL
          AKENKKVALQRMKNTELVNYFPFAYFKDDFTEKLPHKKVLTTILQKQNLAFATSLVIGTD
          LADEIQAAENAKIQSLWLAPKKVKMPISPHPTLHLNKLNDLLFYLELS
      • aln_aa_to_dna.py
        • This scripts converts the alignments into nucleotide sequences rather than amino acid sequences EXAMPLE:
        • cat ./database/Orthofinder/firm5_ortho_sequences/OG0001034_aln.fasta >Ga0061073_1479
          ———————GTGACTAAATATCAAACGTTAATTTTTGTTCCTGAAGGT
          AGTTTATTAAATGAAAAAACGGCTGAACAAGTCGCACTCAGGCAAACTTTAAAAGAACTC
          GGACATGATTTTGGACCAGCTGAACGCCTAAAATATTCTAGCTTACAAGGACAAGTTAAA
          ATGATGGGTTTCAGCGAGCGCATTGCACTAACCCTGCAAAATTTTTGTACCGACGATTTG
          GCTGAGGCCGAAAAAATTTTCAAAACAAAATTAGGAGGTCAGCGACAACTAGTCAAAGAT
          GCTATTCCATTTCTTGACCAAATAACA———AACCAAGTTAAGCTAATTCTCCTT
          GCCAAAGAAGAACGTGAACTAATCTCAGCTCGCCTATCTGATAGCGAACTACTTAACTAT
          TTTTCTGCTTCCTATTTTAAAGAAGATTTTGCTGATCCTTTGCCAAATAAAAATGTCCTG
          TTTCAAATTATAAAAGAGCAAGAATTAGATCCAGATAATTGCCTAGTTATCGGCACAGAT
          TTAGTTGAAGAAATTCAAGGAGCAGAAAACGCTGGCTTGCAATCATTATGGATTGCACCA
          AAAAAGGTTAAAATGCCAATTAGTCCTCGACCTACTCTGCATTTAACTAAACTCAATGAC
          TTGCTTTTTTATCTTGAATTAAAC
          >Ga0070887_12184
          ATG—AAAGGAAAAGTACACTTGGCAAAATATGAAACTTTAATTTTTATTCTTGAAGGA
          AGCTTATTAAACGAAAAAGTTGCAGAACAAAATGCACTTAGGCAAACTTTGAAATTAACT
          GGCAGAGAATATGGTCCAGCTGAGCGCATACAATATAATTCATTACAAGAAAAGATTAAA
          TTACTAGGATTTGATGAGCGCATTAAATTAACTTTGCAGGAATTCTTTAAAAATGACTGG
          ATTTCTGCGAAAGGCACTTTTTATAACCAGTTGCAAAAACAAGATCAGTTAAATAAAGAT
          GTAGTGCCCTTTTTAGATGAGGTGAAA———AACAAAGTTAACTTGGTTTTGCTG
          ACGAAAGAGAAAAAAGATGTGGCTTCATACCGCATGCAAAATACAGAGCTAATAAATTAT
          TTTTCCGCAGTTTATTTTAAAGACGATTTTGCATGTAAGTTTCCAAATAAGAAGGTTTTG
          ATAACAATATTGCAGCAGCAGAATCTGACGCCAGCCACTTGTCTTGTAATTGGGACAAAC
          TTAGTCGATGAAATTCAGGGTGCCGAAAATGCTAACCTGGATTCTCTTTGGCTAGCGCCC
          AAGAAAGTAAAAATGCCAATTAGTCCACGTCCAACTTTACATTTAAATAAATTAACTGAT
          TTATTATTTTACCTAGAATTAAGC
          >Ga0072400_11133
          ———————TTGGCAAAATTTCAAACATTAATTTTTATTCTTGAGGGC
          AGTTTATTAGATGAAAAGATTGCTGAACAAAGTGCATTAAAGCAAACTTTAAAGTCAACT
          GGCAGAGATTTTGGTCCCAGTGAACGTTTAAAATATAATTCTGTACGAGAAAATAATAAG
          TTGCTTGGCTTTGAAGACCGCATACAATTAATTTTACAAACATTTTTTCATGAAAATTGG
          CAAGATGCAGGGCAGATTTTTATCAAAGAATTACAAAAGCAAAATCGCTTGAATAAAGAA
          GTATTGCCATTTTTAAACAAAGTTAAC———TGCAAGGTTAAACTAATTCTGCTG
          GCAAAAGAGAACAAAAAAGTAGCATTACAGCGCATGAAGAACACAGAGTTGGTAAATTAT
          TTTCCGTTTGCTTATTTTAAAGATGACTTTACGGAAAAATTGCCACATAAAAAAGTTTTG
          ACCACCATTTTGCAGAAACAAAACTTGGCGTTCGCAACTAGTTTAGTAATCGGAACTGAC
          TTAGCAGATGAAATTCAGGCTGCAGAGAATGCCAAAATACAGTCACTCTGGCTAGCGCCT
          AAGAAAGTAAAAATGCCGATTAGCCCGCACCCAACTTTACATTTAAATAAATTAAACGAT
          TTATTATTTTACCTAGAATTAAGC
      • trim_aln.py and sed to simplify headers to contain just genome ID and leave out gene identifier (as they are all single copy core genes).
        • This script trims out all the sections that do not align by counting which positions have “-” and removing those from all the members of the orthogroup.
        • cat ./database/Orthofinder/firm5_ortho_sequences/OG0001034_aln_trim.fasta
          >Ga0061073
          GTGACTAAATATCAAACGTTAATTTTTGTTCCTGAAGGTAGTTTATTAAATGAAAAAACG
          GCTGAACAAGTCGCACTCAGGCAAACTTTAAAAGAACTCGGACATGATTTTGGACCAGCT
          GAACGCCTAAAATATTCTAGCTTACAAGGACAAGTTAAAATGATGGGTTTCAGCGAGCGC
          ATTGCACTAACCCTGCAAAATTTTTGTACCGACGATTTGGCTGAGGCCGAAAAAATTTTC
          AAAACAAAATTAGGAGGTCAGCGACAACTAGTCAAAGATGCTATTCCATTTCTTGACCAA
          ATAACAAACCAAGTTAAGCTAATTCTCCTTGCCAAAGAAGAACGTGAACTAATCTCAGCT
          CGCCTATCTGATAGCGAACTACTTAACTATTTTTCTGCTTCCTATTTTAAAGAAGATTTT
          GCTGATCCTTTGCCAAATAAAAATGTCCTGTTTCAAATTATAAAAGAGCAAGAATTAGAT
          CCAGATAATTGCCTAGTTATCGGCACAGATTTAGTTGAAGAAATTCAAGGAGCAGAAAAC
          GCTGGCTTGCAATCATTATGGATTGCACCAAAAAAGGTTAAAATGCCAATTAGTCCTCGA
          CCTACTCTGCATTTAACTAAACTCAATGACTTGCTTTTTTATCTTGAATTAAAC
          >Ga0070887
          TTGGCAAAATATGAAACTTTAATTTTTATTCTTGAAGGAAGCTTATTAAACGAAAAAGTT
          GCAGAACAAAATGCACTTAGGCAAACTTTGAAATTAACTGGCAGAGAATATGGTCCAGCT
          GAGCGCATACAATATAATTCATTACAAGAAAAGATTAAATTACTAGGATTTGATGAGCGC
          ATTAAATTAACTTTGCAGGAATTCTTTAAAAATGACTGGATTTCTGCGAAAGGCACTTTT
          TATAACCAGTTGCAAAAACAAGATCAGTTAAATAAAGATGTAGTGCCCTTTTTAGATGAG
          GTGAAAAACAAAGTTAACTTGGTTTTGCTGACGAAAGAGAAAAAAGATGTGGCTTCATAC
          CGCATGCAAAATACAGAGCTAATAAATTATTTTTCCGCAGTTTATTTTAAAGACGATTTT
          GCATGTAAGTTTCCAAATAAGAAGGTTTTGATAACAATATTGCAGCAGCAGAATCTGACG
          CCAGCCACTTGTCTTGTAATTGGGACAAACTTAGTCGATGAAATTCAGGGTGCCGAAAAT
          GCTAACCTGGATTCTCTTTGGCTAGCGCCCAAGAAAGTAAAAATGCCAATTAGTCCACGT
          CCAACTTTACATTTAAATAAATTAACTGATTTATTATTTTACCTAGAATTAAGC
          >Ga0072400
          TTGGCAAAATTTCAAACATTAATTTTTATTCTTGAGGGCAGTTTATTAGATGAAAAGATT
          GCTGAACAAAGTGCATTAAAGCAAACTTTAAAGTCAACTGGCAGAGATTTTGGTCCCAGT
          GAACGTTTAAAATATAATTCTGTACGAGAAAATAATAAGTTGCTTGGCTTTGAAGACCGC
          ATACAATTAATTTTACAAACATTTTTTCATGAAAATTGGCAAGATGCAGGGCAGATTTTT
          ATCAAAGAATTACAAAAGCAAAATCGCTTGAATAAAGAAGTATTGCCATTTTTAAACAAA
          GTTAACTGCAAGGTTAAACTAATTCTGCTGGCAAAAGAGAACAAAAAAGTAGCATTACAG
          CGCATGAAGAACACAGAGTTGGTAAATTATTTTCCGTTTGCTTATTTTAAAGATGACTTT
          ACGGAAAAATTGCCACATAAAAAAGTTTTGACCACCATTTTGCAGAAACAAAACTTGGCG
          TTCGCAACTAGTTTAGTAATCGGAACTGACTTAGCAGATGAAATTCAGGCTGCAGAGAAT
          GCCAAAATACAGTCACTCTGGCTAGCGCCTAAGAAAGTAAAAATGCCGATTAGCCCGCAC
          CCAACTTTACATTTAAATAAATTAAACGATTTATTATTTTACCTAGAATTAAGC
      • calc_perc_id_orthologs.py
        • Uses as input, trimmed aligned sequences and a metafile (./database/genome_db_210402_metafile.txt) which is a tab-delim file with genome-id in tab1 and SDP-affiliation in tab 3
        • First, it checks the number of SDPs contained within the alignment. If more than one, it continues by calculating alignment percentage identity stats across SDPs. If only one SDP, exits script.
        • Next, it Compares the genomes in each SDP to all other genomes in the alignment: calculates percentage identity for all pairwise combinations. Calculates the max, min, and mean values, prints to file ./database/Orthofinder/{phylotype}_perc_id.txt showing one orthogroup per line. EXAMPLE:
        • cat ./database/Orthofinder/firm5_perc_id.txt

          OG0001034 0.674 0.586 0.972

  • rule filter_orthogroups
    • orthogroups are filtered based on:
      • Minimum gene-length 300bp (applied to all members of each gene-family)
      • Inter-SDP max alignment identity 95% (only if the phylotype contain multiple SDPs)
    • Short genes are filtered off, because they are likley to be less reliable for accurate coverage estimates. Similarly, the inter-SDP similarity threshold is used to ensure that there is enough divergence between the SDPs for reliable mapping (at least as estimated from the currently availabe genomes). It is worthwhile to check the number of gene-families before/after filtering. If a lot of gene-families were filtered off, this could be an indication that the SDPs are not properly discrete.
    • This finally results in the single-copy core genes that have been filtered to be used for core coverage estimation. ./database/Orthofinder/{phylotype}_single_ortho_filt.txt.
  • rule core_cov
    • takes as input bam files with the alignments for each sample to be considered (as a text file containing a list of these files) and the _single_ortho_filt file. Outputs are written to ./04_CoreCov_"+ProjectIdentifier+"/{phylotype}_corecov.txt.
    • The script reads the filtered orthofile ./database/Orthofinder/{phylotype}_single_ortho_filt.txt and gets the gene-famililes and genome-ids for each SDP.
    • Then, from bed files, it finds the start and end positions of each of the genes in an orthogroup for each of the genomes of the orthogroup. It writes these to the file, ./04_CoreCov_*/{phylotype}_corecov.txt each SDP in the phylotype, start position for each gene family in the genome marked reference for that SDP.
    • The coverage is also written to this file. Example:
      • SDP Sample OG Ref_pos Coverage
        firm5_1 DrY1_N1_microbiome_mapped OG0000932 448 18.81
        firm5_1 DrY1_N1_microbiome_mapped OG0000931 1991 23.34

        firm5_2 M1.5_microbiome_mapped OG0000935 1852405 12.29
        firm5_2 M1.5_microbiome_mapped OG0000934 1853270 9.95
        firm5_3 DrY1_N1_microbiome_mapped OG0000932 1 501.61
        firm5_3 DrY1_N1_microbiome_mapped OG0000931 1542 534.77

        firm5_bombus M1.5_microbiome_mapped OG0000936 1674767 0.0
        firm5_bombus M1.5_microbiome_mapped OG0000935 1676256 0.0
        firm5_bombus M1.5_microbiome_mapped OG0000934 1677124 0.0
    • SDP abundance is estimated based on mapped read coverage of core genes. It sums up gene coverages of all the genes og OG families associated with said SDP across genomes belogining to the SDP.
    • It also reports PTR (Peak-Trough Ratio).
    • Most species in the database are represented by multiple genomes (< 98.5% gANI between genomes). Core genes are inferred at the phylotype. More accurate estimates can be obtained by using a large number (+700) of core genes.
  • rule core_cov_plots
    • This R-script will estimate the coverage at the terminus, using the summed core gene family coverages. If the cov-ter cannot be properly estimated (fx. due to draft genome status or lack of replication), an estimate will be generated using the median coverage across core gene families, and the PTR is set to NA. If more than 20% of the core gene families have no coverage, the abundance will be set to zero. As output, a tabular file is generated (including the cov-ter/median cov, and PTR), and a pdf-file with plots for visual validation.
    • First, filter for samples with coverage of at least 1 on > 80% of the core genes. Next, values that are deviating no more than 2 times the median are kept others are discarded as outliers.
    • Next, gets fitted coordinates and append values to coord-table. It does this by using the segmented package. As explained below,

where, lin.mod is a simple linear model that was made by base R. The R package Segmented supports breakpoint analysis. The methods used by this package are applicable when segments are (nearly continuous) so this means that for the regression to make sense the core gene families selected should cover the reference genome well and without too many huge gaps. psi, is a starting value of the breakpoint. Example of a model fit using segemented,

x is the slope of the first segment and U1.x is the difference in slopes between the first and second segment. psi_est is the newly estimated breakpoint. This along with the slopes - The summary function shows:

Finally, from the segmented model the ptr is calculated as follows:

For cor_ter is the coverage \(y = ax + b\) where x is psi (the breakpoint on the x axis) and the cor_ori2 is the coverage at the ori which is the section with maximum coverage and at the tail. The condition (psi<psi_min) || (psi>psi_max) || (min_ori_cov<cov_ter) checks: First; If the break-point is too far from the expected place (+/-50% of break-point estimate), ptr is set to NA. Second; If the coverage at ori (either beginning or end of dataframe) is lower than the estimated coverage at ter, ptr is also set to NA. Finally, if the coverage of the origin is not greater than the terminus, ptr is set to NA.

  • the PTR was set to NA, the median will be plotted and used for quantification. Else, the segmented regression line is plotted, and the terminus coverage is used for quantification.

  • rule assemble_host_unmapped

    • Takes as input the R1 and R2 reads that were not mapped to the host and assembles them using spades with the --meta tag and default parameters.
    • Memory allocation is not obvious. More documentation on this soon.
  • rule map_to_assembly

    • Map reads that were assembled against the contigs that they were assembled into using bwa mem.
  • rule cat_and_clean_counts_assembly

    • Compiles counts into one file for summarizing
  • rule summarize_mapping_assembly

    • similar to earlier rule “summarize_mapping”
  • rule backmapping

    • NxN mapping for
  • rule merge_depths

  • rule binning

  • rule process_metabat2

  • rule checkm_evaluation

  • rule prepare_info_for_drep

  • rule drep

  • rule gtdb_annotate

  • rule compile_report

  • rule backup

  • rule onsuccess

5.3 Scripts

  • aln_aa_to_dna.py
  • aln_calc.sh
  • calc_perc_id_orthologs.py
  • core_cov.py
  • core_cov.R
  • download_data.py
  • extract_orthologs.py
  • fasta_generate_regions.py
  • filter_bam.py
  • filter_orthologs.py
  • filter_sam_aln_length.pl
  • filter_sam_aln_length_unmapped.pl
  • filter_snvs.pl
  • filt_vcf_samples.pl
  • get_single_ortho.py
  • parse_core_cov.py
  • parse_spades_metagenome.pl
  • rearange_faa.py
  • subset_orthofile.py
  • trim_aln.py
  • ./scripts/write_adapters.py

5.4 Envs

core-cov-env.yaml

mapping-env.yaml

rmd-env.yaml

snv-env.yaml

trim-qc-env.yaml

6 Next steps

It is clear that the database is not best suited for some SDPs found especially in host species other than Apis mellifera. So, the next step would be to implement a MAG based analysis to compare these samples. However, as the database was already shown to be well-suited for Apis mellifera and Apis cerana, another set of analysis would compare these samples from the publication (zenodo) with the samples from India.

6.1 Choosing representative genomes

Information in drep documentation.

\[A * Completeness - B * Contamination + C * (Contamination * frac{strainheterogeneity}{100}) + D * log(N50) + E * log(size) + F * (centrality - S_{a}ni) \]s

“It makes no sense to perform bootstrap with less than 4 sequences” from IQTree

7 Results

7.1 Total number of reads before and after trimming

Total reads by species

7.2 Total number of reads per species after trimming

Total reads by species

7.3 Proportion of reads mapping to host and microbiome database

Proportion of reads mapped to the microbiome database and host database

7.4 Micorbiome reads lost by non_specific mapping to host

Proportion of reads mapped to the microbiome database and host database

7.5 Community profiling with mOTUs

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name            grob
## 1 1 (1-1,1-1) arrange gtable[arrange]
## 2 2 (2-2,1-1) arrange  gtable[layout]

mOTUs Summary

8 MAG analysis

8.1 Mapping to Assembly

The host-filtered reads were assembled. The proportion of reads that mapped back to the assembly and some information about the assemblies such as assembly size, and information about contigs are shown below.

## TableGrob (3 x 1) "arrange": 3 grobs
##   z     cells    name              grob
## 1 1 (1-1,1-1) arrange   gtable[arrange]
## 2 2 (2-2,1-1) arrange gtable[guide-box]
## 3 3 (3-3,1-1) arrange   gtable[arrange]

8.2 contig length and fate (binning outcome)

8.3 MAGs prevalence and abundance

Mean of mean contig coverage (calculated for metabat2) as a proxy for abundance of MAG in it’s own sample.

Total MAGs recovered - distribution of quality / total + per sample

There are a total of 1087 MAGs present. There were 246 high quality MAGs (> 95% complete and < 5% redundant) and 534 with > 50% completion and < 10% redundancy.

8.4 MAG taxonomy affiliation per sample

MAG quality summarised

The common threshold (thumb-rule) of >70% completeness and contamination <5% along with N50 >10Kb (N50 threshold does not exclude any genomes) seems appropriate as it does not exclude too many MAGs. Completeness is currently the most powerful criterion.

There are a total of 260 clusters. They represent 19 families. Only 95 of them are represented by MAGs that cross the threshold comprising 17 families.

The passed MAGs include 26 out of 32 genera. I exclude the following because they only contain 1 MAG:

33 out of 41 species.

## 
## Call:
## glm(formula = number_of_clusters ~ Host, family = "poisson", 
##     data = vis_magOTUs_df_numClusters_all)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3825  -0.9929  -0.1208   0.3363   3.0240  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.91957    0.05998  48.679  < 2e-16 ***
## HostApis dorsata    0.44080    0.07689   5.733 9.87e-09 ***
## HostApis florea    -1.25817    0.12750  -9.868  < 2e-16 ***
## HostApis mellifera  0.04570    0.11793   0.388    0.698    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 360.675  on 49  degrees of freedom
## Residual deviance:  90.223  on 46  degrees of freedom
## AIC: 320.56
## 
## Number of Fisher Scoring iterations: 5
## 
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  vis_magOTUs_df_numClusters_all$number_of_clusters and vis_magOTUs_df_numClusters_all$Host 
## 
##                Apis cerana Apis dorsata Apis florea
## Apis dorsata   0.0018      -            -          
## Apis florea    1.2e-05     1.2e-05      -          
## Apis mellifera 0.9650      0.0472       0.0018     
## 
## P value adjustment method: fdr 
## 
## Call:
## glm(formula = number_of_clusters ~ Host, family = "poisson", 
##     data = vis_magOTUs_df_numClusters)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5230  -0.6147  -0.1130   0.6001   2.1680  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          2.2892     0.0822  27.849  < 2e-16 ***
## HostApis dorsata     0.4624     0.1049   4.406 1.05e-05 ***
## HostApis florea     -1.1260     0.1661  -6.779 1.21e-11 ***
## HostApis mellifera  -0.1144     0.1717  -0.666    0.505    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 195.145  on 49  degrees of freedom
## Residual deviance:  60.018  on 46  degrees of freedom
## AIC: 261.24
## 
## Number of Fisher Scoring iterations: 4
## 
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  vis_magOTUs_df_numClusters$number_of_clusters and vis_magOTUs_df_numClusters$Host 
## 
##                Apis cerana Apis dorsata Apis florea
## Apis dorsata   0.0164      -            -          
## Apis florea    2.0e-05     2.1e-05      -          
## Apis mellifera 0.2268      0.0419       0.0022     
## 
## P value adjustment method: fdr
## [1] 1
## [1] 1
## 
## Call:
## anosim(x = df_magOTUs_vegan, grouping = df_meta$SpeciesID, permutations = 9999,      distance = "jaccard") 
## Dissimilarity: jaccard 
## 
## ANOSIM statistic R: 0.9945 
##       Significance: 1e-04 
## 
## Permutation: free
## Number of permutations: 9999